The dramatic increase in the cost of discovery resources is highlighted by the fact that a single, traditionally synthesized compound is estimated to cost $6,000 per research size sample². These factors have contributed to a paradigm shift in the way pharmaceutical research is being conducted; companies are adopting approaches which reduce costs early in the drug development process.

Figure-1: Artificial Neural Network

Combinatorial chemistry and high through-put screening (HTS) are exciting techniques that are being adopted by the pharmaceutical and agrochemical industries in an effort to reduce costs and shorten discovery and optimization time. Computational scientists are contributing to this effort through combinatorial chemistry library analysis, diversity analysis, and quantitative structure activity relationship (QSAR) studies. QSAR studies rely heavily upon statistics to derive mathematical models which relate the biological activity of a series of compounds to one or more properties of the molecules. These properties, or descriptors, may be derived from numerous sources including refractive index, octanol/water partition coefficient or spectral data. In cases where experimental values for these properties are not available, several programs, including the popular³. CLOGP program, can be used for the computation of octanol/water partition coefficients alternatively, theoretical properties also may be obtained from computational programs such as MOPAC⁴. A plethora of graph theory-based topological descriptors are available from programs such as MOLCONN-X⁵. Extensive lists of substituent parameters describing electronic (sigma), lipophilic (pi), and steric properties (cMR and Taft coefficients) are also available. The initial phase of a QSAR study requires the collection of many of these descriptors prior to model building⁶.

Figure-2: Neural Network

The seminal work in the field of QSAR was report by Hansch et. al. who demonstrated the use of regression analysis for model building⁷. In the intervening years since Hansch introduced regression analysis to chemistry, other methods have been developed and explored to circumvent some of the problems associated with this technique.

The success of regression analysis in QSAR model building depends upon an assumed linear relationship between the biological activity and one or more descriptors. As the number of descriptors increase however, regression analysis becomes problematic. One problem likely to occur in large descriptor sets, for example, is redundancy in information when descriptors are correlated. Latent variable techniques have become accepted methods of addressing this issue⁸. These techniques include the use of principal components in regression analysis and the method of partial least squares. A second problem encountered in using regression analysis is the a priori assumption of a model form (i.e. quadratic, cubic, use of cross terms, etc.). In order to address this issue, variable selection techniques such as stepwise forward and stepwise backward multiple linear regression analysis (MLR) were introduced. One recurrent problem in all of these methods is the fact that by using computational methods to generate descriptors, a modern dataset may contain more descriptors than compounds, that is, more columns of parameters than rows of compounds. These results in the introduction of the insidious problem described by Topliss and Edwards - those correlations observed may be chance correlations⁹.

Figure-3: QSAR study

Intriguing approaches using machine learning methods have been under study in the field of chemistry for the past decade. The first description of a simple neural network was provided in 1943¹⁰. Interest in neural networks was slow until the 1980s when new computer architecture and learning algorithms began to appear. The use of artificial neural networks (ANN) in all fields has since grown substantially. In 1988, Hoskins et. al. reported the first use of process control in chemistry¹¹. This was followed by two reports using ANN for prediction of protein secondary structure. The use of ANN in chemistry has further expanded into the analysis of spectral data, pharmaceutical product development, classification of anticancer compounds, prediction of chemical reactivity, physical properties, electrostatic potential, ionization potentials as well as QSARs¹².

Neural networks are part of a new era of evolving computer technology in which a computer system has been designed to learn from data in a manner emulating the learning pattern in the brain. Neural networks are typically used when there are a large number of observations and when the problem is not understood well enough to write a procedural program or expert system. Using neural networks, the solution to the problem is sought as follows:

1. An answer is calculated by multiplying each input by the connection weight;

2. Products are summed at each hidden unit where a non-linear transfer function is applied; and

3. The output of each hidden unit is then multiplied by the connection weight and summed and interpreted.

The neural network "learns" by repeatedly passing through the data and adjusting its connection weights to minimize the error; in this case, the predicted, versus the actual biological activity. A neural network is thus a mathematical model to describe a non-linear hyper surface. The increasing interest and availability of neural network software has prompted several groups to apply this technology in QSAR studies. The application of neural networks as a substitute for discriminant analysis. Neural networks in QSAR in a manner similar to multiple regression analysis. Comparative study of neural networks and regression analysis using a set of dihydrofolate reductase inhibitors. Their results indicated neural networks were superior to regression analysis in providing accurate predictions, but that the design of the neural net was critical to obtaining these results¹³.

First the design of the network is critical with respect to the number of hidden units involved. The network will over fit or memorize the data if too many hidden units are used. Conversely, the network will fail to generalize and become unstable if too few hidden units are used. The second factor which must be considered is the length of the training time. It is possible that networks may be overstrained, and thus destabilized, through the use of excessive training periods. Third, the selection of an appropriate test set and training set are important. The test set should adequately represent the entire dataset and be sufficiently large in order to properly train the neural network. The test set should also be contained within the neural network model. In addition it should be large enough to provide for an assessment of the model. Finally, the results obtained from neural networks can be difficult to interpret and apply to the drug design problem. This issue is especially troublesome for the medicinal chemist who is not an expert in the use and interpretation of neural network technology.

We have reported our initial results in the use of neural networks to identify the descriptors most relevant to biological activity. The present report describes our continued work in this area and the enhancements we have made in the methodology. Our objective is to provide additional tools for the medicinal chemist in the area of molecular design. Our focus is to apply neural network technology early in the development of the SAR in a manner that, for the medicinal chemist, is easy to use. A further goal is to provide a technique whose results are both relevant and interpretable. Methodology has been developed and incorporated within a program, named AUTONET that represents a self training neural network. Results from the neural network are presented visually in order to rapidly and easily convey to the medicinal chemist the important features derived by the neural network¹⁴.

Figure-4: Sample neural network showing connections for 2 inputs and 3 hidden units methods

An artificial neural network consists of a number of "neurons" or "hidden units" that receive data from the outside, process the data, and output a signal. A "neuron" is essentially a regression equation with a non-linear output. When more than one of these neurons is used, non-linear models can be fitted. These networks have been shown to work well for modeling a number of different problems, including QSAR. Neural networks are known for their ability to model a wide set of functions without knowing the model a priori. The back propagation network receives a set of inputs which are multiplied by each neuron's weights (Figure 4). These products are summed for each neuron and a non-linear transfer function is applied. The bias has the effect of shifting the transfer function to the left or right. The transformed sums are then multiplied by the output weights where they are summed a final time, transformed, and interpreted. Since a back-propagation network is a supervised method, the desired output must be known for each input vector so an error (the difference between the desired output and the network's predicted output) can be calculated. This error is propagated backwards through the network (thus the name), adjusting the weights so that the next time the network sees the same input pattern, it will come closer to the desired output. The patterns are shown many times until the network either learns the relation or determines that there is none¹⁵.

METHODOLOGY: For our purposes, the input vector and output values were normalized between 0.1 and 0.9 by column. This ensures that no exceptionally large valued descriptors will have an undue effect on the network. All of the connection weights are initialized to very small random numbers (+/- 0.0005). This is necessary so each hidden unit will respond to a slightly different feature in the input vector. Each hidden unit outputs the hyperbolic tangent of the sum of the products of the inputs and the weights (Equation 1).

The hyperbolic tangent function has a range of -1 to 1, with the highest gain near 0. This compresses the output of the unit which defines a maximum contribution for each hidden unit. The output unit takes the sum of the products of the hidden units and the weights (Equation 2) and applies the transfer function (Equation 3).

x is the value of the output unit. This function has minimum and maximum values of 0 and 1 respectively. Since the output values were normalized between 0.1 and 0.9, this allows the network to slightly exceed the minimum and maximum values that were given in the original data file¹⁶.

Once the output is calculated, it is compared to the desired output value for that particular vector (the biological activity). An error,, is calculated according to Equation 4 and is used in a gradient descent algorithm to adjust the weights of the network (Equation 5).

= experimental - predicted (4)

Where is the learning rate which controls the step size of the gradient descent algorithm. The learning rate is typically between 0 and 1 and is decreased during training as the solution is reached. The term (out (1-out)) is the derivative of the transfer function (Equation 3). The hidden unit weights are adjusted in a similar manner (Equation 6). The term (1 + out) (1 - out) is the derivative of the hyperbolic tangent transfer function.

Each input vector and desired output pair for the entire training set was presented to the network and the weights were adjusted. The training set was generated by sorting all the data based on the output, biological activity, and then every fourth compound was placed in a testing set and the remaining compounds were used for the training set. Sorting the small datasets that are typical of QSAR studies ensured that the test set was as representative as possible. One complete cycle through the data is called an epoch. During each epoch, the order in which the compounds were presented was randomized. This procedure improved the overall performance of the neural network. The training and testing errors were calculated (Equation 7) for the testing set every 10 epochs and this value was saved with the network weights.

If the testing error had not decreased in 250 epochs, the network was returned to that set of weights. Often this technique will produce a large number of networks of different sizes that have similar training and testing errors. The best network is the one with the smallest testing error. The r² value is also checked since it is possible to have a small error and a poor r² value. If there are several networks of similar errors, the smallest network is often the easiest to interpret. The model knowledge in the neural network can be discerned by examining the weights. As the weight from an input descriptor to a hidden unit approaches zero, then the effect that the descriptor can have on the model approaches zero. However, it was not obvious which descriptors were contributing to the model and which only had chance effects since every descriptor had a weight coefficient. In order to make cutoff criteria, three random descriptors were included as input vectors. The networks were trained with these random descriptors along with the other descriptors. Now it was possible to compare the various chemical descriptors against the random descriptors to determine which descriptors were more significant than random noise. A second set of networks was trained with a reduced set of descriptors. Only those descriptors whose absolute value of the weight coefficient was larger than the largest of the random descriptors were used. In addition the descriptors from the best (lowest testing error) network were automatically included. Often, the networks using this "reduced" set of descriptors will outperform the original set. This increases the likelihood that a difficult model can be solved and also that an easy-to-interpret network will be constructed. Non-linear effects are also determined. An examination of the weights for a descriptor where the largest weight and the second largest are of opposite sign and are at least half the magnitude of the largest weight in the network suggests the presence of a non-linear effect. This tends to identify non-linear effects with a fewer number of compounds than otherwise possible¹⁷.

In order to present the chemist with useful information from the neural network, certain data are visualized. The hidden unit weights for each descriptor for each network are displayed in a color map. A green color indicates that a weight value is near zero, blue is a negative weight, and red is positive. If an output weight is negative, all of the weights entering that particular hidden unit are multiplied by (-1). This can be done since the hyperbolic tangent function is symmetric about the y axis. If a descriptor has all red weights, then increasing the value of that descriptor will have a positive effect on the output of the network. The chemist thus can quickly see which descriptors are consistently having an effect on the different models. The numerical value of the hidden unit weights is also available. The chemist can also examine the weights of the best network by testing error, or if there are several that are close, the smallest hidden unit size. Each network is given a brightness which is proportional to its testing error. Random descriptors are given colors as well. The chemist may opt to use these descriptors in a multiple regression study.

All networks were of the back-propagation type and trained on a Silicon Graphics Workstation. The program AUTONET is written in the language C. The networks were trained to predict activity. A hyperbolic tangent transfer function was used with a user definable learning rate coefficient between 0.1 and 1.0. All inputs were normalized between 0.1 and 0.9. Multiple networks, at least 3, were developed for each level of hidden units specified. The number of hidden units was predetermined at discrete levels of 1,3,5,7,9,11 and 21 depending on the number of compounds in the dataset. Presentation of inputs to the ANN was randomized after every epoch. Each network was also started from a randomized input in order to start each one at a different point on the response surface. Results from all of the networks were compared. The total number of networks built is two times the product of the number of passes and the number of hidden units. This is the combined total of networks constructed with the complete set of descriptors and the number of networks built with the reduced set of descriptors. For example, 30 networks are built for each training/test set with 3 passes and 5 hidden units¹⁸.

Results and Discussion:

In a typical neural network application, the dataset is randomly divided into two subsets. One group, the larger of the two, is used to train the network while the smaller subset is used to evaluate the predictive power of the network. QSAR datasets are typically small in the early stages of the project and thus it becomes impractical to reduce them substantially. In the present study, the datasets were sorted by activity and exemplars were removed for training purposes by one of two methods. The first method requires the removal of compounds from the sorted dataset at predetermined intervals. The second method, used for smaller datasets, is a leave-one-out procedure requiring the removal of each compound, one at a time, to serve as the test case. In each case the observation(s) that was removed served as the test case for the network. If the datasets were not sorted before these techniques were applied, the learning power of the resulting neural network was compromised.

In order to determine if our method would allow the neural networks to merely memorize data, even random noise, we conducted the following experiment. Datasets were created containing 20 "compounds" and 10, 20, 40, or 80 descriptors of random numbers. The output representing biological activity was a random number. A network with 13 descriptors (10 descriptors plus 3 random noises) and 1 hidden unit has 14 adjustable parameters. The same dataset with 21 hidden units has 274 adjustable parameters and thus a definite potential to over fit the data exists. With the current methodology using the leave-one-out protocol, all the networks that were created consistently memorized the data as evidenced by three characteristics: 1) a low training error on the learning set and a high testing error on the test set; 2) generally (>75%) less than 100 epochs in each network; and 3) the hidden weight coefficients, when examined in the color maps, were similar to known random noise descriptors.

Dataset 1 (Selwood Dataset): A critical step prior to the construction of a neural network is the selection of the appropriate size of the training set and the test set. Care must be taken to ensure that each set is representative of the other. The training set should be as large as possible in order to provide the neural network with the best opportunity to learn. The test set should be sufficiently large to provide new cases in order to fairly evaluate the neural network. The Selwood dataset was sorted according to the output response and every fourth record was removed. This created a test set of 31/4 = 7 records (records 4, 8, 12, 16, 20, 24, 28) and a training set of 24 records. Training sets were randomized before the first epoch and before each subsequent epoch. Three descriptors of random numbers were added, the learning rate was set at 0.6, and a total of 15 networks were generated with all of the descriptors. Three networks were generated for each hidden unit level of 1, 3, 5, 7, and 9. The results are depicted in using 3 colors (positive weight coefficients in red and negative weight coefficients in blue). The left side of the panel represents the networks that include all of the descriptors (one descriptor per row and one network per column). Each network is given a brightness which is proportional to its testing error and thus the darker appearing columns are those networks with the largest testing error. A few networks failed to find any descriptor more important than another; as evidenced by a green vertical bar. However, the remaining networks had low testing errors and found one or more descriptors to be consistently important to the learning behavior. The right side of the panel represents 15 new networks using only those descriptors that frequently provided weights greater than any of the three random number descriptors. The results indicate that the learning behavior of the ANNs was most often related to several descriptors. The descriptors ATCH4, ESDL3 and CLOGP were identified as important to the learning of the ANN and had positive weight coefficients while ATCH6, DIPV_X, DIPV_Z, NSDL1, and NSDL7 were important but had negative weight coefficients¹⁹.

Regardless of the learning behavior of the individual neural network, useful information is readily conveyed by the color panels. Notably the display illustrates that not all network configurations behave similarly. Some networks failed to learn; as evidenced by the vertical green columns. The networks with the smaller number of hidden units, 1 and 3 located at the left edge of the panel, appear to have the best learning results. The r² training and r² testing in the output file suggests that these networks were predictive; although this is not required of the AUTONET derived neural network nor is it the focus of our interest. Due to our deliberate under training of the neural network, it is probable that this method found local minima which may be responsible for the results. Since each network starts from a different point on the response surface, these minima may or may not be the same. The actual predictive power of ANN derived from local minima may be suspect. The critical observation is that the solutions to these local minima are derived from a similar set of descriptors. The interpretation of the learning behavior of the network is possible due to the commonality of the solutions (e.g., the important descriptors for learning) to these local minima from the entire collection of multiple networks²⁰.

As reported previously multiple regression analysis may then applied to the dataset using the most important descriptors identified by the neural network. The Selwood dataset has been the subject of analysis by numerous approaches. These studies illustrate a general point in model building that many models exist to explain a dataset. The purpose of the exercise is to find reasonable models upon which to base additional experiments.

Dataset 2 (Dunn Dataset): This was a small dataset of only 13 compounds and 5 descriptors as originally reported. We added 58 descriptors in order to better represent a realistic situation in which no descriptor bias is assumed. Since this dataset was small we used a true cross validation technique to train the neural networks in which every compound is removed once to serve as the training set. For each network 3 descriptors of random data were added and the learning rate set equal to 0.6. The color scheme for these plates is the same as described for Dataset 1. The results are more complex than the Selwood results as more networks are produced. The results following the removal of one compound to serve as the test case are depicted in a set of 15 columns representing the 15 networks (3 passes x 5 sets of hidden units). The results in the panel are those from a total of 195 networks. The interpretation is further complicated by the fact that the first and last columns are black indicating the network was not able to train properly when the observations at either extreme are removed to serve as the test cases. This behavior is indicative of a model from which extrapolations are not possible. Similarly, darker columns within the body of the color plate indicate those compounds whose removal produced less satisfactory network training. However useful information becomes apparent from the color plates. The sigma descriptor and it component Swain-Lupton R descriptor appear important and have a negative weight coefficient. Thus one may conclude that the electronic effects of the substituent may be important to the biological activity of these compounds. This in agreement with the results from regression analysis reported in previous studies²¹.

Dataset 3 (Howbert Dataset): The largest dataset studied contained 47 compounds. Unlike the previous examples, the dataset represents a classification problem (active/inactive) based upon the in vivo biological potency. Using the same methodology as described, the neural networks were trained to identify features for the correct classification of the compound. Active compounds were designated with the number 1 and inactive compounds were designated with the number 0 according to the definition provided by Howbert et.al²². Three neural networks were trained at each of the hidden unit levels (1,3,5,7,9,11,21) following the addition of 3 descriptors of random numbers with a learning rate equal to 0.6.

It is apparent from the results depicted in the neural networks strongly identified the VDWVOL, a negative weight coefficient, and pi, a positive weight coefficient, as descriptors important in the correct classification of the compounds into active or inactive groups. Both of these descriptors were identified by Howbert et. al. using cluster significance analysis of their data. Multiple regression analysis did not yield a statistically valid model. Examination of the training and testing errors indicate, as expected, the neural network was not predictive²³.

Conclusion:

The applications of neural networks have required large datasets and have shown extensive training periods in order to achieve a predictive solution. We have show using the techniques with the neural network program AUTONET that it is possible to extract information using a neural network from relative small datasets and from networks that are not statistically predictive. The AUTONET method uses a series of multiple, short training neural networks to provide local minima as solutions. The information content is extracted from the coefficients of the hidden weights associated with the input descriptors with the overall solution provided by a consensus of solutions to the local minima. We found that in spite of short training periods the neural network memorized random data very quickly. However a characteristic profile of an overstrained network was identified as: 1) a low training error on the learning set and a high testing error on the test set; 2) generally (>75%) less than 100 epochs in each network; and 3) the hidden weight coefficients similar to each other and similar to the coefficients of know random noise descriptors. The addition of three descriptors of random numbers allow for the establishment of a level from which to judge background noise and chance correlation. The focus of the AUTONET approach is not to achieve a predictive solution. We have attempted to broaden the scope of the utility of neural networks in QSAR by gaining information in the absence of a predictive solution. An often encountered difficulty with neural networks is their lack of interpretation. The AUTONET approach addresses this through the visual display of the hidden unit weights and thus rapidly conveys useful and informative results to the user.

References:

1. Cangelosi Angelo and Parisi Domenico (1998), Emergence of language in an evolving population of neural networks. Connection Science 10(2):83-97.

2. Huston S., (1995). Integrated Strategies in Drug Discovery 23, 19-21.

3. Ferrer Cancho, Ramon, and Solé, Ricard V., (2002). Zipf's law and random texts. Advances in Complex Systems 5(1):1-6.

4. Miller, George A., 1957, Some effects of intermittent silence, American Journal of Psychology, 70: 311-314.

5. Kirby, Simon. (2001). Spontaneous evolution of linguistic structure: an iterated learning model of the emergence of regularity and irregularity. IEEE Transactions on Evolutionary Computation 5(2): 102-110.

6. Boyd, D.; Seward, C. M. QSAR: Rational Approaches to the Design of Bioactive Compounds. Silipo, C.; Vittoria, A., Ed.; Elsevier Science Publishers B. V.: Amsterdam, 1991; 167-170.

7. Hansch, C.; Muir, R. M.; Fujita, T.; Maloney, P.P.; Geiger, F. (1963); Streich, M. The Correlation of Biological Activity of Plant Growth Regulators and Chloromycetin Derivatives with Hammett Constants and Partition Coefficients. J. Am. Chem. Soc. 85, 2817-2824.

8. Dunn, W. J.; Wold, S.; Edlund, U.; Hellberg, S.; Gasteiger, (1984), J. Quant. Struct.-Act. Relat., 3, 131-137.

9. Topliss, J. G.; Edwards, R. P., (1979), Chance Factors in Studies of Quantitative Structure Activity Relationships. J. Med. Chem., 22, 1238- 1244.

10. McCulloch, W. S.; Pitts, W. A., (1943), Logical Calculus of the Ideas Immanent in Nervous Activity. Bull. of Math. Bio., 5, 115-133.

11. Hoskins, J. C.; Himmelbau, D. M., (1988), Artificial Neural Network Models of Knowledge Representation in Chemical Engineering. Comput. Chem. Eng., 12, 881-890.

12. (a) Qian, N.; Sejnowski, T. J. (1988), Predicting the Secondary Structure of Globular Proteins Using Neural Network Models. J. Mol. Biol., 202, 865-884. (b) Bohr, H.; Bohr, J.; Brunak, S.; Cotterill, R.; Lautrup, B.; Norskov, L.; Olsen, O.; Petersen, S. (1988), Protein Secondary Structure and Homology by Neural Networks, FEBS Lett., 241, 223-228.

13. T. Rives, S. S. (1994) Prediction of Atomic Ionization Potentials I-III Using an Artificial Neural Network. J. Chem. Inf. Comput. Sci., 34, 617-620.

14. Rumelhart, D. B. Parallel Distributed Processing, Feldman, J. A.; Hayes, P. J.; Rumelhart, D. B., Ed.; The MIT Press, London, 1982, 1, 318-363.

15. Aoyama, T.; Suzuki, Y.; Ichikawa, H. (1990), Neural Networks Applied to Structure-Activity Relationships. J. Med. Chem., 33, 905-908.

16. T. A.; Kalayeh, H. (1991) Application of Neural Networks J. Med. Chem., 34, 2824-2836.

17. Richards, W. G. (1992), Application of Neural Networks: Quantitative Structure-Activity Relationships of the derivatives of 2,4-Diamino-5-(substituted-benzyl) pyrimidines as DHFR Inhibitors. J. Med. Chem., 35, 3201-3207.

18. Manallack, D. T..; Ellis, D. D.; Livingstone, D. J. (1994), Analysis of Linear and Nonlinear QSAR Data Using Neural Networks. J. Med. Chem., 34, 3758-3767.

19. Gakh, A. A.; Gakh, E. R.; Sumpter, B. G.; Nord, D. W. (1994), Neural Network-Graph Theory Approach to the Prediction of the Physical Properties of Organic Compounds. J. Chem. Inf. Comput. Sci., 34, 832- 839.

20. Wikel, J. H.; Dow, E. R. (1993), The Use of Neural Networks for Variable Selection in QSAR. Bioorg. Med. Chem. Lett., 3, 645-651.

21. Manallack, D. T.; Livingstone, D. J. Neural Networks and Expert Systems in Molecular Design. Methods Princ. Med. Chem. 1995, 3 (Advanced Computer-Assisted Techniques in Drug Discovery), 293-318

22. Selwood, D. L.; Livingstone, D. J.; Comley, J. C.; O'Dowd, A. B.; Hudson, A. T.; Jackson, P.; Jandu, K. S.; Rose, V. S.; Stables, (1990), J. N. Structure-Activity Relationships of Antifilarial Antimycin Analogues: A Multivariate Pattern Recognition Study J. Med. Chem., 33, 136-142.

23. Dunn, W. J.; Greenberg, M. J.; Callejas, S. S. (1976), Use of Cluster Analysis in the Development of Structure-Activity Relations for Antitumor Triazenes. J. Med. Chem., 19, 1299-1301.

Received on 26.11.2010

Accepted on 20.12.2010

Research J. Science and Tech. 3(1): Jan.-Feb. 2011: 17-24